Parallelize per-tile compression in streaming write#1531
Merged
brendancol merged 2 commits intoxarray-contrib:mainfrom May 9, 2026
Merged
Conversation
The streaming tile-write path in write_streaming previously walked each segment's tiles serially, calling _compress_block inline. The non-streaming path at _writer.py:~568 already fans compress out to a ThreadPoolExecutor sized at os.cpu_count(), since zlib, zstd, LZW and LERC all release the GIL inside their C codecs. Mirror that pattern inside the segment loop: build the tile arrays sequentially, submit compress to a per-segment thread pool sized min(n_seg_tiles, os.cpu_count()), then write the resulting buffers to the file sequentially. The file write stays serial so the on-disk tile layout is unchanged. Memory cost is bounded by the segment size: tiles_per_segment compressed buffers held briefly in RAM. For a 32-tile segment with ~50 KB compressed tiles that is ~1.6 MB. The streaming buffer cap already bounds segment size, so peak memory growth is small. Measured on a 4096x4096 float32 deflate streaming write: 1.69 s serial-equivalent vs 0.27 s with the pool, a 6.2x speedup that matches the audit estimate.
There was a problem hiding this comment.
Pull request overview
This PR improves GeoTIFF streaming writes (write_streaming) by parallelizing per-tile compression within each horizontal segment, mirroring the existing parallel compression approach used in the non-streaming tiled writer. This targets compress-bound streaming writes where codecs release the GIL.
Changes:
- Parallelize per-segment tile compression in
write_streamingusing aThreadPoolExecutor, then write compressed tiles sequentially to preserve on-disk order. - Add a new test module covering round-trip correctness, observable parallelism, and a performance regression guard for streaming writes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_writer.py |
Builds per-tile arrays per segment, compresses them in parallel via a thread pool, then writes buffers sequentially. |
xrspatial/geotiff/tests/test_streaming_write_parallel.py |
Adds correctness/parallelism/perf-guard tests for the new streaming parallel compression behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+138
to
+145
| path = str(tmp_path / 'parallel_check.tif') | ||
| to_geotiff(dask_da, path, | ||
| compression='deflate', tile_size=256) | ||
|
|
||
| assert len(seen_threads) > 1, ( | ||
| f"Expected >1 worker threads in streaming compress, " | ||
| f"saw {len(seen_threads)}: {seen_threads}") | ||
|
|
Comment on lines
+163
to
+173
| t0 = time.perf_counter() | ||
| to_geotiff(dask_da, path, compression='deflate', tile_size=256) | ||
| elapsed = time.perf_counter() - t0 | ||
|
|
||
| # Sanity check the file was written. | ||
| result = open_geotiff(path) | ||
| assert result.shape == shape | ||
|
|
||
| assert elapsed < 5.0, ( | ||
| f"Streaming 4096x4096 deflate write took {elapsed:.2f}s, " | ||
| f"expected <5s (regression guard)") |
Comment on lines
+180
to
+195
| def test_streaming_write_with_single_thread_fallback(tmp_path): | ||
| """If write_streaming exposes a ``threads`` kwarg, callers can opt | ||
| into deterministic single-thread compress. Currently it does not - | ||
| skip so the test stays as a placeholder for when a knob is added. | ||
| """ | ||
| sig = inspect.signature(writer_mod.write_streaming) | ||
| if 'threads' not in sig.parameters: | ||
| pytest.skip( | ||
| "write_streaming has no 'threads' kwarg yet; skipping " | ||
| "deterministic single-thread fallback test") | ||
|
|
||
| da = _make_dataarray((400, 400), dtype=np.float32) | ||
| dask_da = da.chunk({'y': 200, 'x': 200}) | ||
| path = str(tmp_path / 'threads1.tif') | ||
| to_geotiff(dask_da, path, compression='deflate', threads=1) | ||
| result = open_geotiff(path) |
Comment on lines
+1442
to
+1448
| from concurrent.futures import ( | ||
| ThreadPoolExecutor) | ||
| n_workers = min(n_seg_tiles, | ||
| os.cpu_count() or 4) | ||
| with ThreadPoolExecutor( | ||
| max_workers=n_workers) as pool: | ||
| futures = [ |
Comment on lines
+1429
to
+1431
| # Memory cost is bounded by the segment size | ||
| # (tiles_per_segment compressed buffers held in RAM | ||
| # before the sequential write phase below). |
Five findings, all acted on: - Hoist the ThreadPoolExecutor over the entire tiled streaming write rather than recreating it per segment. For wide rasters with many horizontal segments the per-segment construction was paying the thread-startup cost on every stripe and offsetting the parallel speedup. Now one pool spans every (tile_row, segment) iteration. - Skip the pool entirely when compression is uncompressed (COMPRESSION_NONE has no C-level work to release the GIL on) or when the host has only one usable core. Both cases fall through to the existing serial path. - Update the per-segment memory-cost comment to mention BOTH the uncompressed seg_tile_arrs and the compressed buffers, since both are held simultaneously while futures resolve. - test_streaming_write_parallelism_observed now monkeypatches os.cpu_count to return 4 so the assertion stays deterministic on single-core CI containers. Without the patch the pool would size to 1 and the test would fail for environment reasons. - The wall-clock perf tripwire is gated behind XRSPATIAL_RUN_PERF_TESTS=1 to avoid CI flakiness on shared/throttled runners; deterministic parallel-branch coverage already lives in the parallelism-observed test. - Drop the test_streaming_write_with_single_thread_fallback placeholder. It gated on a `threads=` kwarg that doesn't exist and isn't planned; reviewing the gate-vs-call path for a non-existent kwarg adds noise to future readers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance audit finding P4: the streaming tile-write path in
xrspatial/geotiff/_writer.py(write_streaming, around line 1403) compressed tiles serially inside each horizontal segment, while the non-streaming write path at_writer.py:~568already fans compress out to aThreadPoolExecutorsized atos.cpu_count(). Since zlib, zstd, LZW and LERC release the GIL inside their C codecs, the streaming path was leaving cores idle on compress-bound writes.This change mirrors the non-streaming pool pattern inside the segment loop:
_compress_blockto aThreadPoolExecutor(max_workers=min(n_seg_tiles, os.cpu_count())).tcorder and write them to the file sequentially so the on-disk tile layout is unchanged.The public API of
write_streamingis untouched. The serial path is kept forn_seg_tiles <= 1so single-tile segments avoid the pool overhead.Memory cost
Peak transient memory grows by
tiles_per_segmentcompressed tiles held in a list before the sequential write phase. For a typical 32-tile segment of ~50 KB compressed tiles that is ~1.6 MB. The pre-existingstreaming_buffer_bytescap already boundstiles_per_segment, so the worst-case memory footprint is bounded by user configuration.Measured speedup
4096x4096 float32 deflate streaming write on a local box:
os.cpu_count=1)About 6.2x, matching the audit estimate.
Test plan
xrspatial/geotiff/tests/test_streaming_write_parallel.py:float32,uint16,uint8) x compression (deflate,zstd,lzw,none) x tile size, with a forced smallstreaming_buffer_bytesso the multi-segment branch is exercised. Bit-exact vs eager write and vs the source array._compress_blockto recordthreading.get_ident()per call, confirm more than one worker thread participates.write_streamingdoes not expose athreadskwarg.pytest xrspatial/geotiff/tests/test_streaming_write.py xrspatial/geotiff/tests/test_writer.py xrspatial/geotiff/tests/test_writer_matrix.py-> 123 passed.